What exactly is “useful stuff” in R?

(Credits: Some of these materials in this module were adapted from Software Carpentry)

For some, it might just be basic calculations

63.24 * pi # Multiply 63.24 by pi
## [1] 198.6743
exp(x = 4.39) # Raise e to the power of 4.39
## [1] 80.64042
log(x = 1.7) # Take the log of 1.7
## [1] 0.5306283
tan(x = 58) # Compute the tangent of 58
## [1] 8.330857

For others, it might be large or complex mathematical operations

# Take one million samples from the standard normal distribution
data.sample <- rnorm(n = 1000000, mean = 0, sd = 1) 

# Build a 1000 x 1000 matrix from the sample data
big.matrix <- matrix(data = data.sample, ncol = 1000) 

dim(x = big.matrix) # Confirm that "big.matrix" is 1000 x 1000
## [1] 1000 1000
big.matrix.inverse <- solve(a = big.matrix) # Compute the inverse of "big.matrix"
system.time(expr = solve(a = big.matrix)) # Compute time required to invert "big.matrix"
##    user  system elapsed 
##   0.362   0.294   0.119

Useful Stuff: Applied Research Edition

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)

For most applied researchers, “useful stuff” that can be done in R boils down to a few core items:

  1. Manipulating data frames, e.g. filtering, summarizing, and conducting calculations across groups.
  2. Tidying data into the appropriate format
  3. Attempting to describe relationships or conduct causal inference

Dataframe Manipulation/Example Dataset

For this unit, we'll be working with the “Gapminder” dataset, which is excerpt of the data available at Gapminder.org. For each of 142 countries, the data provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

gapminder <- read.csv("../data/gapminder-FiveYearData.csv", stringsAsFactors = TRUE)
head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Dataframe Manipulation/Base R Functions

So far, you’ve seen the basics of manipulating data frames, e.g. subsetting, merging, and basic calculations. For instance, we can use base R functions to calculate summary statistics across groups of observaitons:

mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
## [1] 7902.15

But this isn't ideal because it involves a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

Dataframe Manipulation/dplyr

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition. As an added bonus, you might even find the dplyr grammar easier to read.

Here we're going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()
  6. arrange()

If you have have not installed this package earlier, please do so now:

# not run
# install.packages('dplyr')

Now let's load the package:

library(dplyr)

Dataframe Manipulation/dplyr/select

Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.

year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

If we open up year_country_gdp, we'll see that it only contains the year, country and gdpPercap. This is equivalent to the base R subsetting function:

year_country_gdp <- gapminder[,c("year", "country", "gdpPercap")]
head(year_country_gdp)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

But, as we will see, dplyr makes for much more readible, efficient code because of its pipe operator.

Dataframe Manipulation/dplyr/pipes

Above, we used what's called 'normal' grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we've seen in R before, let's repeat what we've done above using pipes.

year_country_gdp <- gapminder %>% select(year,country,gdpPercap)

Let's walk through it step by step. First we summon the gapminder dataframe and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don't specify which data object we use in the select() function since in gets that from the previous pipe.

Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is |. But the concept is the same!

Dataframe Manipulation/dplyr/filter

Now let's say we're only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.

year_country_gdp_euro <- gapminder %>%
    filter(continent == "Africa") %>%
    select(year,country,gdpPercap)

As with last time, first we pass the gapminder dataframe to the filter() function, then we pass the filtered version of the gapminder dataframe to the select() function.

To clarify, both the select and filter functions subsets the data frame. The difference is that select extracts certain columns, while filter extracts certain rows.

Note: The order of operations is very important in this case. If we used 'select' first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Dataframe Manipulation/dplyr/Calculations Across Groups

A common task you'll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculated the mean GDP per capita for each continent?

In base R, you would have to run the mean() function for each subset of data:

mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
## [1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
## [1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
## [1] 7902.15
mean(gapminder$gdpPercap[gapminder$continent == "Europe"])
## [1] 14469.48
mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])
## [1] 18621.61

That's a lot of repetition! To make matters worse, what if we wanted to add these values to our original data frame as a new column? We would have to write something like this:

gapminder$mean.continent.GDP <- NA

gapminder$mean.continent.GDP[gapminder$continent == "Africa"] <- mean(gapminder$gdpPercap[gapminder$continent == "Africa"])

gapminder$mean.continent.GDP[gapminder$continent == "Americas"] <- mean(gapminder$gdpPercap[gapminder$continent == "Americas"])

gapminder$mean.continent.GDP[gapminder$continent == "Asia"] <- mean(gapminder$gdpPercap[gapminder$continent == "Asia"])

gapminder$mean.continent.GDP[gapminder$continent == "Europe"] <- mean(gapminder$gdpPercap[gapminder$continent == "Europe"])

gapminder$mean.continent.GDP[gapminder$continent == "Oceania"] <- mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])

You can see how this can get pretty tedious, especially if we want to calculate more complicated or refined statistics. We could use loops or apply functions, but these can be difficult, slow, or error-prone.

Dataframe Manipulation/dplyr/apply-split-combine

The abstract problem we're encountering here is know as “split-apply-combine”:

We want to split our data into groups (in this case continents), apply some calculations on that group, then combine the results together afterwards.

Module 4 gave some ways to do split-apply-combine type stuff using the apply family of functions, but those too are error prone and messy.

Luckily, dplyr offers a much cleaner, straight-forward solution to this problem.

# remove this column -- there's a better way!
gapminder$mean.continent.GDP <- NULL

Dataframe Manipulation/dplyr/group_by

We've already seen how filter() can help us select observations that meet certain criteria (in the above: continent == "Europe"). More helpful, however, is the group_by() function, which will essentially use every unique criteria that we could have used in filter().

A grouped_df can be thought of as a list where each item in the list is a data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).

Dataframe Manipulation/dplyr/summarize

The above was a bit uneventful because group_by() is much more exciting in conjunction with the summarize() function. This will allow use to create new variable(s) by using functions that repeat for each of the continent-specific data frames. In other words, using the group_by() function, we split our original dataframe into multiple pieces, which we then use to run functions (e.g. mean() or sd()) within summarize().

gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents)
## Source: local data frame [5 x 2]
## 
##   continent mean_gdpPercap
## 1    Africa       2193.755
## 2  Americas       7136.110
## 3      Asia       7902.150
## 4    Europe      14469.476
## 5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent. But it gets even better – the function group_by() allows us to group by multiple variables. Let's group by year and continent.

gdp_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents_byyear)
## Source: local data frame [6 x 3]
## Groups: continent
## 
##   continent year mean_gdpPercap
## 1    Africa 1952       1252.572
## 2    Africa 1957       1385.236
## 3    Africa 1962       1598.079
## 4    Africa 1967       2050.364
## 5    Africa 1972       2339.616
## 6    Africa 1977       2585.939

That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in summarize().

gdp_pop_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
head(gdp_pop_bycontinents_byyear)
## Source: local data frame [6 x 6]
## Groups: continent
## 
##   continent year mean_gdpPercap sd_gdpPercap mean_pop   sd_pop
## 1    Africa 1952       1252.572     982.9521  4570010  6317450
## 2    Africa 1957       1385.236    1134.5089  5093033  7076042
## 3    Africa 1962       1598.079    1461.8392  5702247  7957545
## 4    Africa 1967       2050.364    2847.7176  6447875  8985505
## 5    Africa 1972       2339.616    3286.8539  7305376 10130833
## 6    Africa 1977       2585.939    4142.3987  8328097 11585184

Dataframe Manipulation/dplyr/mutate

What if we wanted to add these values to our original data frame instead of creating a new object? For this, we can use the mutate() function, which is similar to summarize() except it creates new variables to the same dataframe that you pass into it.

gapminder_with_extra_vars <- gapminder %>%
    group_by(continent, year) %>%
    mutate(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
head(gapminder_with_extra_vars)
## Source: local data frame [6 x 10]
## Groups: continent, year
## 
##       country year      pop continent lifeExp gdpPercap mean_gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453       5195.484
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530       5787.733
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007       5729.370
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971       5971.173
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811       8187.469
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134       7791.314
## Variables not shown: sd_gdpPercap (dbl), mean_pop (dbl), sd_pop (dbl)

We can use also use mutate() to create new variables prior to (or even after) summarizing information.

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))
head(gdp_pop_bycontinents_byyear)
## Source: local data frame [6 x 8]
## Groups: continent
## 
##   continent year mean_gdpPercap sd_gdpPercap mean_pop   sd_pop
## 1    Africa 1952       1252.572     982.9521  4570010  6317450
## 2    Africa 1957       1385.236    1134.5089  5093033  7076042
## 3    Africa 1962       1598.079    1461.8392  5702247  7957545
## 4    Africa 1967       2050.364    2847.7176  6447875  8985505
## 5    Africa 1972       2339.616    3286.8539  7305376 10130833
## 6    Africa 1977       2585.939    4142.3987  8328097 11585184
## Variables not shown: mean_gdp_billion (dbl), sd_gdp_billion (dbl)

Dataframe Manipulation/dplyr/arrange

As a last step, let's say we want to sort the rows in our data frame according to values in a certain column. We can use the arrange() function to do this. For instance, let's organize our rows by year (recent first), and then by continent.

gapminder_with_extra_vars <- gapminder %>%
    group_by(continent, year) %>%
    mutate(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop)) %>%
    arrange(desc(year), continent)
head(gapminder_with_extra_vars)
## Source: local data frame [6 x 10]
## Groups: continent, year
## 
##        country year     pop continent lifeExp gdpPercap mean_gdpPercap
## 1      Algeria 1952 9279525    Africa  43.077 2449.0082       1252.572
## 2       Angola 1952 4232095    Africa  30.015 3520.6103       1252.572
## 3        Benin 1952 1738315    Africa  38.223 1062.7522       1252.572
## 4     Botswana 1952  442308    Africa  47.622  851.2411       1252.572
## 5 Burkina Faso 1952 4469979    Africa  31.975  543.2552       1252.572
## 6      Burundi 1952 2445618    Africa  39.031  339.2965       1252.572
## Variables not shown: sd_gdpPercap (dbl), mean_pop (dbl), sd_pop (dbl)

Tidying Data

Even before we conduct analysis or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy

The two most important properties of tidy data are:

1) Each column is a variable. 2) Each row is an observation.

Tidy data is easier to work with, because you have a consistent way of referring to variables (as column names) and observations (as row indices). It then becomes easy to manipulate, visualize, and model.

For more on the concept of tidy data, read Hadley Wickham's paper here

Tidying Data/Wide vs. Long Formats

“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham

Tabular datasets can be arranged in many ways. For instance, consider the data below. Each data set displays information on heart rate observed in individuals across 3 different time periods. But the data are organized differently in each table.

wide <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  time1 = c(67, 80, 64),
  time2 = c(56, 90, 50),
  time3 = c(70, 67, 101)
)
wide
##      name time1 time2 time3
## 1  Wilbur    67    56    70
## 2 Petunia    80    90    67
## 3 Gregory    64    50   101
long <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory"),
  time = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  heartrate = c(67, 80, 64, 56, 90, 50, 70, 67, 10)
)
long
##      name time heartrate
## 1  Wilbur    1        67
## 2 Petunia    1        80
## 3 Gregory    1        64
## 4  Wilbur    2        56
## 5 Petunia    2        90
## 6 Gregory    2        50
## 7  Wilbur    3        70
## 8 Petunia    3        67
## 9 Gregory    3        10

Question: Which one of these do you think is the tidy format?

Answer: The first dataframe (the “wide” one) would not be considered tidy because values (i.e., heartrate) are spread across multiple columns.

We often refer to these different structurs as “long” vs. “wide” formats. In the “long” format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the “wide” format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). In the above case, we had the same kind of data (heart rate) entered across 3 different columns, corresponding to three different time periods.

You may find data input may be simpler or some other applications may prefer the “wide” format. However, many of R’s functions have been designed assuming you have “long” format data.

Tidying Data/Gapminder

Lets look at the structure of our original gapminder dataframe:

head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Question: Is this data frame wide or long?

Answer: This data frame is somewhere in between the purely 'long' and 'wide' formats. We have 3 “ID variables” (continent, country, year) and 3 “Observation variables” (pop, lifeExp, gdpPercap).

Despite not having ALL observations in 1 column, this intermediate format makes sense given that all 3 observation variables have different units. As we have seen, many of the functions in R are often vector based, and you usually do not want to do mathematical operations on values with different units.

On the other hand, there are some instances in which a purely long or wide format is ideal (e.g. plotting). Likewise, sometimes you'll get data on your desk that is poorly organized, and you'll need to reshape it.

Tidying Data/tidyr

Thankfully, the tidyr package will help you efficiently transform your data regardless of original format.

# Install the "tidyr" package (only necessary one time)
# install.packages("tidyr") # Not Run

# Load the "tidyr" package (necessary every new R session)
library(tidyr)

Tidying Data/tidyr/gather

Until now, we’ve been using the nicely formatted original gapminder dataset. This dataset is not quite wide and not quite long – it's something in the middle, but 'real' data (i.e. our own research data) will never be so well organized. Here let's start with the wide format version of the gapminder dataset.

gap_wide <- read.csv("../data/gapminder_wide.csv", stringsAsFactors = FALSE)
head(gap_wide)
##   continent      country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
## 1    Africa      Algeria      2449.0082      3013.9760      2550.8169
## 2    Africa       Angola      3520.6103      3827.9405      4269.2767
## 3    Africa        Benin      1062.7522       959.6011       949.4991
## 4    Africa     Botswana       851.2411       918.2325       983.6540
## 5    Africa Burkina Faso       543.2552       617.1835       722.5120
## 6    Africa      Burundi       339.2965       379.5646       355.2032
##   gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982
## 1      3246.9918      4182.6638      4910.4168      5745.1602
## 2      5522.7764      5473.2880      3008.6474      2756.9537
## 3      1035.8314      1085.7969      1029.1613      1277.8976
## 4      1214.7093      2263.6111      3214.8578      4551.1421
## 5       794.8266       854.7360       743.3870       807.1986
## 6       412.9775       464.0995       556.1033       559.6032
##   gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002
## 1      5681.3585      5023.2166      4797.2951      5288.0404
## 2      2430.2083      2627.8457      2277.1409      2773.2873
## 3      1225.8560      1191.2077      1232.9753      1372.8779
## 4      6205.8839      7954.1116      8647.1423     11003.6051
## 5       912.0631       931.7528       946.2950      1037.6452
## 6       621.8188       631.6999       463.1151       446.4035
##   gdpPercap_2007 lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967
## 1      6223.3675       43.077       45.685       48.303       51.407
## 2      4797.2313       30.015       31.999       34.000       35.985
## 3      1441.2849       38.223       40.358       42.618       44.885
## 4     12569.8518       47.622       49.618       51.520       53.298
## 5      1217.0330       31.975       34.906       37.814       40.697
## 6       430.0707       39.031       40.533       42.045       43.548
##   lifeExp_1972 lifeExp_1977 lifeExp_1982 lifeExp_1987 lifeExp_1992
## 1       54.518       58.014       61.368       65.799       67.744
## 2       37.928       39.483       39.942       39.906       40.647
## 3       47.014       49.190       50.904       52.337       53.919
## 4       56.024       59.319       61.484       63.622       62.745
## 5       43.591       46.137       48.122       49.557       50.260
## 6       44.057       45.910       47.471       48.211       44.736
##   lifeExp_1997 lifeExp_2002 lifeExp_2007 pop_1952 pop_1957 pop_1962
## 1       69.152       70.994       72.301  9279525 10270856 11000948
## 2       40.963       41.003       42.731  4232095  4561361  4826015
## 3       54.777       54.406       56.728  1738315  1925173  2151895
## 4       52.556       46.634       50.728   442308   474639   512764
## 5       50.324       50.650       52.295  4469979  4713416  4919632
## 6       45.326       47.360       49.580  2445618  2667518  2961915
##   pop_1967 pop_1972 pop_1977 pop_1982 pop_1987 pop_1992 pop_1997 pop_2002
## 1 12760499 14760787 17152804 20033753 23254956 26298373 29072015 31287142
## 2  5247469  5894858  6162675  7016384  7874230  8735988  9875024 10866106
## 3  2427334  2761407  3168267  3641603  4243788  4981671  6066080  7026113
## 4   553541   619351   781472   970347  1151184  1342614  1536536  1630347
## 5  5127935  5433886  5889574  6634596  7586551  8878303 10352843 12251209
## 6  3330989  3529983  3834415  4580410  5126023  5809236  6121610  7021078
##   pop_2007
## 1 33333216
## 2 12420476
## 3  8078314
## 4  1639131
## 5 14326203
## 6  8390505

The first step towards getting our nice intermediate data format is to first convert from the wide to the long format. The function gather() will 'gather' the observation variables into a single variable. This is sometimes called “melting” your data, because it melts the table from wide to long. Those data will be melted into two variables: one for the variable names, and the other for the variable values.

gap_long <- gap_wide %>%
    gather(obstype_year, obs_values, 3:38)
head(gap_long)
##   continent      country   obstype_year obs_values
## 1    Africa      Algeria gdpPercap_1952  2449.0082
## 2    Africa       Angola gdpPercap_1952  3520.6103
## 3    Africa        Benin gdpPercap_1952  1062.7522
## 4    Africa     Botswana gdpPercap_1952   851.2411
## 5    Africa Burkina Faso gdpPercap_1952   543.2552
## 6    Africa      Burundi gdpPercap_1952   339.2965

Notice that we put 3 arguments into the gather() function:

  1. the name the new column for the new ID variable (obstype_year),
  2. the name for the new amalgamated observation variable (obs_value),
  3. the indices of the old observation variables (3:38, signalling columns 3 through 38) that we want to gather into one variable. Notice that we don't want to melt down columns 1 and 2, as these are considered “ID” variables.

Tidying Data/tidyr/select

We can also select observation variables using:

See the select() function in dplyr for more options.

For instance, here we do the same thing with (1) the starts_with function, and (2) the - operator:

# with the starts_with() function
gap_long <- gap_wide %>%
    gather(obstype_year, obs_values, starts_with('pop'),
           starts_with('lifeExp'), starts_with('gdpPercap'))
head(gap_long)
##   continent      country obstype_year obs_values
## 1    Africa      Algeria     pop_1952    9279525
## 2    Africa       Angola     pop_1952    4232095
## 3    Africa        Benin     pop_1952    1738315
## 4    Africa     Botswana     pop_1952     442308
## 5    Africa Burkina Faso     pop_1952    4469979
## 6    Africa      Burundi     pop_1952    2445618
# with the - operator
gap_long <- gap_wide %>% 
  gather(obstype_year, obs_values, -continent, -country)
head(gap_long)
##   continent      country   obstype_year obs_values
## 1    Africa      Algeria gdpPercap_1952  2449.0082
## 2    Africa       Angola gdpPercap_1952  3520.6103
## 3    Africa        Benin gdpPercap_1952  1062.7522
## 4    Africa     Botswana gdpPercap_1952   851.2411
## 5    Africa Burkina Faso gdpPercap_1952   543.2552
## 6    Africa      Burundi gdpPercap_1952   339.2965

However you choose to do it, notice that the output collapses all of the measure variables into two columns: one containing new ID variable, the other containing the observation value for that row.

Tidying Data/tidyr/separate

You'll notice that in our long dataset, obstype_year actually contains 2 pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.

We can use the separate() function to split the character strings into multiple variables:

gap_long_sep <- gap_long %>% 
  separate(obstype_year, into = c('obs_type','year'), sep = "_") %>% 
  mutate(year = as.integer(year))
head(gap_long_sep)
##   continent      country  obs_type year obs_values
## 1    Africa      Algeria gdpPercap 1952  2449.0082
## 2    Africa       Angola gdpPercap 1952  3520.6103
## 3    Africa        Benin gdpPercap 1952  1062.7522
## 4    Africa     Botswana gdpPercap 1952   851.2411
## 5    Africa Burkina Faso gdpPercap 1952   543.2552
## 6    Africa      Burundi gdpPercap 1952   339.2965

Tidying Data/tidyr/spread

The opposite of gather() is spread(). It spreads our observation variables back out to make a wider table. We can use this function to spread our gap_long() to the original “medium” format.

gap_medium <- gap_long_sep %>% 
  spread(obs_type, obs_values)
head(gap_medium)
##   continent country year gdpPercap lifeExp      pop
## 1    Africa Algeria 1952  2449.008  43.077  9279525
## 2    Africa Algeria 1957  3013.976  45.685 10270856
## 3    Africa Algeria 1962  2550.817  48.303 11000948
## 4    Africa Algeria 1967  3246.992  51.407 12760499
## 5    Africa Algeria 1972  4182.664  54.518 14760787
## 6    Africa Algeria 1977  4910.417  58.014 17152804

All we need is some quick fixes to make this dataset identical to the original gapminder dataset:

gapminder <- read.csv("../data/gapminder-FiveYearData.csv")
head(gap_medium)
##   continent country year gdpPercap lifeExp      pop
## 1    Africa Algeria 1952  2449.008  43.077  9279525
## 2    Africa Algeria 1957  3013.976  45.685 10270856
## 3    Africa Algeria 1962  2550.817  48.303 11000948
## 4    Africa Algeria 1967  3246.992  51.407 12760499
## 5    Africa Algeria 1972  4182.664  54.518 14760787
## 6    Africa Algeria 1977  4910.417  58.014 17152804
head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
# rearrange columns
gap_medium <- gap_medium[,names(gapminder)]
head(gap_medium)
##   country year      pop continent lifeExp gdpPercap
## 1 Algeria 1952  9279525    Africa  43.077  2449.008
## 2 Algeria 1957 10270856    Africa  45.685  3013.976
## 3 Algeria 1962 11000948    Africa  48.303  2550.817
## 4 Algeria 1967 12760499    Africa  51.407  3246.992
## 5 Algeria 1972 14760787    Africa  54.518  4182.664
## 6 Algeria 1977 17152804    Africa  58.014  4910.417
# arrange by country, continent, and year
gap_medium <- gap_medium %>% 
  arrange(country,continent,year)
head(gap_medium)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Extra Resources

dplyr and tiyr have many more functions to help you wrangle and manipulate your data. See the Data Wrangling Cheat Sheet for more.

Describing Relationships & Causal Inference

Inference/Regression

glm(formula = y~x1+x2+x3+..., family = familyname(link = "linkname"), data = )
# Regress tip percent on total bill and party size
reg <- glm(formula = lifeExp ~ gdpPercap + pop + continent + year, 
                family = gaussian, data = gapminder)
# View objects contained in the regression output
objects(reg)
##  [1] "aic"               "boundary"          "call"             
##  [4] "coefficients"      "contrasts"         "control"          
##  [7] "converged"         "data"              "deviance"         
## [10] "df.null"           "df.residual"       "effects"          
## [13] "family"            "fitted.values"     "formula"          
## [16] "iter"              "linear.predictors" "method"           
## [19] "model"             "null.deviance"     "offset"           
## [22] "prior.weights"     "qr"                "R"                
## [25] "rank"              "residuals"         "terms"            
## [28] "weights"           "xlevels"           "y"
# Examine regression coefficients
reg$coefficients
##       (Intercept)         gdpPercap               pop continentAmericas 
##     -5.184555e+02      2.984892e-04      1.790640e-09      1.429204e+01 
##     continentAsia   continentEurope  continentOceania              year 
##      9.375486e+00      1.936120e+01      2.055921e+01      2.862583e-01
# Examine regression DoF
reg$df.residual
## [1] 1696
# Examine regression fit (AIC)
reg$aic
## [1] 11420.07
summary(reg)
## 
## Call:
## glm(formula = lifeExp ~ gdpPercap + pop + continent + year, family = gaussian, 
##     data = gapminder)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -28.4051   -4.0550    0.2317    4.5073   20.0217  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5.185e+02  1.989e+01 -26.062   <2e-16 ***
## gdpPercap          2.985e-04  2.002e-05  14.908   <2e-16 ***
## pop                1.791e-09  1.634e-09   1.096    0.273    
## continentAmericas  1.429e+01  4.946e-01  28.898   <2e-16 ***
## continentAsia      9.375e+00  4.719e-01  19.869   <2e-16 ***
## continentEurope    1.936e+01  5.182e-01  37.361   <2e-16 ***
## continentOceania   2.056e+01  1.469e+00  13.995   <2e-16 ***
## year               2.863e-01  1.006e-02  28.469   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 47.37935)
## 
##     Null deviance: 284148  on 1703  degrees of freedom
## Residual deviance:  80355  on 1696  degrees of freedom
## AIC: 11420
## 
## Number of Fisher Scoring iterations: 2
# Store summary method results
sum.reg <- summary(reg)
# View summary method results objects
objects(sum.reg)
##  [1] "aic"            "aliased"        "call"           "coefficients"  
##  [5] "contrasts"      "cov.scaled"     "cov.unscaled"   "deviance"      
##  [9] "deviance.resid" "df"             "df.null"        "df.residual"   
## [13] "dispersion"     "family"         "iter"           "null.deviance" 
## [17] "terms"
# View table of coefficients
sum.reg$coefficients
##                        Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)       -5.184555e+02 1.989299e+01 -26.062215 3.248472e-126
## gdpPercap          2.984892e-04 2.002178e-05  14.908225  2.522143e-47
## pop                1.790640e-09 1.634107e-09   1.095791  2.733256e-01
## continentAmericas  1.429204e+01 4.945645e-01  28.898241 1.183161e-149
## continentAsia      9.375486e+00 4.718629e-01  19.869087  3.798275e-79
## continentEurope    1.936120e+01 5.182170e-01  37.361177 2.025551e-223
## continentOceania   2.055921e+01 1.469070e+00  13.994707  3.390781e-42
## year               2.862583e-01 1.005523e-02  28.468586 4.800797e-146
glm(formula = y~x1+x2+factor(x3), family = family(link = "link"), data = )

x1:x2 interacts all terms in x1 with all terms in x2

summary(glm(formula = lifeExp ~ gdpPercap + pop + continent:factor(year), 
                family = gaussian, data = gapminder))
## 
## Call:
## glm(formula = lifeExp ~ gdpPercap + pop + continent:factor(year), 
##     family = gaussian, data = gapminder)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -29.5073   -3.4687    0.1739    3.5145   21.1851  
## 
## Coefficients: (1 not defined because of singularities)
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         7.070e+01  4.745e+00  14.901  < 2e-16
## gdpPercap                           3.356e-04  1.997e-05  16.805  < 2e-16
## pop                                 8.980e-10  1.586e-09   0.566 0.571275
## continentAfrica:factor(year)1952   -3.199e+01  4.831e+00  -6.623 4.77e-11
## continentAmericas:factor(year)1952 -1.881e+01  4.919e+00  -3.823 0.000137
## continentAsia:factor(year)1952     -2.617e+01  4.872e+00  -5.371 8.94e-08
## continentEurope:factor(year)1952   -8.208e+00  4.885e+00  -1.680 0.093146
## continentOceania:factor(year)1952  -4.910e+00  6.668e+00  -0.736 0.461687
## continentAfrica:factor(year)1957   -2.991e+01  4.830e+00  -6.191 7.52e-10
## continentAmericas:factor(year)1957 -1.631e+01  4.918e+00  -3.316 0.000933
## continentAsia:factor(year)1957     -2.337e+01  4.871e+00  -4.797 1.75e-06
## continentEurope:factor(year)1957   -6.351e+00  4.883e+00  -1.301 0.193592
## continentOceania:factor(year)1957  -4.307e+00  6.667e+00  -0.646 0.518393
## continentAfrica:factor(year)1962   -2.793e+01  4.830e+00  -5.782 8.83e-09
## continentAmericas:factor(year)1962 -1.397e+01  4.917e+00  -2.840 0.004564
## continentAsia:factor(year)1962     -2.111e+01  4.871e+00  -4.333 1.56e-05
## continentEurope:factor(year)1962   -4.986e+00  4.880e+00  -1.022 0.307127
## continentOceania:factor(year)1962  -3.886e+00  6.666e+00  -0.583 0.560019
## continentAfrica:factor(year)1967   -2.606e+01  4.829e+00  -5.397 7.75e-08
## continentAmericas:factor(year)1967 -1.221e+01  4.915e+00  -2.484 0.013074
## continentAsia:factor(year)1967     -1.810e+01  4.871e+00  -3.715 0.000210
## continentEurope:factor(year)1967   -4.385e+00  4.877e+00  -0.899 0.368777
## continentOceania:factor(year)1967  -4.265e+00  6.664e+00  -0.640 0.522266
## continentAfrica:factor(year)1972   -2.404e+01  4.828e+00  -4.980 7.03e-07
## continentAmericas:factor(year)1972 -1.051e+01  4.914e+00  -2.138 0.032658
## continentAsia:factor(year)1972     -1.619e+01  4.867e+00  -3.327 0.000899
## continentEurope:factor(year)1972   -4.132e+00  4.874e+00  -0.848 0.396689
## continentOceania:factor(year)1972  -4.311e+00  6.662e+00  -0.647 0.517700
## continentAfrica:factor(year)1977   -2.200e+01  4.828e+00  -4.557 5.58e-06
## continentAmericas:factor(year)1977 -8.800e+00  4.912e+00  -1.791 0.073401
## continentAsia:factor(year)1977     -1.377e+01  4.868e+00  -2.829 0.004722
## continentEurope:factor(year)1977   -3.575e+00  4.871e+00  -0.734 0.463098
## continentOceania:factor(year)1977  -3.657e+00  6.662e+00  -0.549 0.583093
## continentAfrica:factor(year)1982   -1.995e+01  4.828e+00  -4.133 3.77e-05
## continentAmericas:factor(year)1982 -7.017e+00  4.912e+00  -1.428 0.153339
## continentAsia:factor(year)1982     -1.065e+01  4.869e+00  -2.188 0.028825
## continentEurope:factor(year)1982   -3.155e+00  4.870e+00  -0.648 0.517193
## continentOceania:factor(year)1982  -2.649e+00  6.661e+00  -0.398 0.690885
## continentAfrica:factor(year)1987   -1.813e+01  4.828e+00  -3.756 0.000179
## continentAmericas:factor(year)1987 -5.253e+00  4.911e+00  -1.070 0.284987
## continentAsia:factor(year)1987     -8.484e+00  4.869e+00  -1.743 0.081591
## continentEurope:factor(year)1987   -2.855e+00  4.868e+00  -0.587 0.557619
## continentOceania:factor(year)1987  -2.255e+00  6.660e+00  -0.339 0.734934
## continentAfrica:factor(year)1992   -1.785e+01  4.828e+00  -3.697 0.000225
## continentAmericas:factor(year)1992 -3.862e+00  4.911e+00  -0.786 0.431776
## continentAsia:factor(year)1992     -7.151e+00  4.867e+00  -1.469 0.141935
## continentEurope:factor(year)1992   -2.006e+00  4.868e+00  -0.412 0.680292
## continentOceania:factor(year)1992  -7.804e-01  6.659e+00  -0.117 0.906723
## continentAfrica:factor(year)1997   -1.792e+01  4.828e+00  -3.711 0.000213
## continentAmericas:factor(year)1997 -2.565e+00  4.910e+00  -0.522 0.601411
## continentAsia:factor(year)1997     -6.076e+00  4.865e+00  -1.249 0.211930
## continentEurope:factor(year)1997   -1.618e+00  4.866e+00  -0.332 0.739563
## continentOceania:factor(year)1997  -5.865e-01  6.658e+00  -0.088 0.929810
## continentAfrica:factor(year)2002   -1.827e+01  4.827e+00  -3.784 0.000160
## continentAmericas:factor(year)2002 -1.429e+00  4.909e+00  -0.291 0.770984
## continentAsia:factor(year)2002     -4.982e+00  4.865e+00  -1.024 0.305936
## continentEurope:factor(year)2002   -1.307e+00  4.864e+00  -0.269 0.788171
## continentOceania:factor(year)2002  -1.530e-02  6.657e+00  -0.002 0.998167
## continentAfrica:factor(year)2007   -1.695e+01  4.826e+00  -3.512 0.000457
## continentAmericas:factor(year)2007 -8.205e-01  4.906e+00  -0.167 0.867195
## continentAsia:factor(year)2007     -4.265e+00  4.862e+00  -0.877 0.380494
## continentEurope:factor(year)2007   -1.481e+00  4.862e+00  -0.305 0.760680
## continentOceania:factor(year)2007          NA         NA      NA       NA
##                                       
## (Intercept)                        ***
## gdpPercap                          ***
## pop                                   
## continentAfrica:factor(year)1952   ***
## continentAmericas:factor(year)1952 ***
## continentAsia:factor(year)1952     ***
## continentEurope:factor(year)1952   .  
## continentOceania:factor(year)1952     
## continentAfrica:factor(year)1957   ***
## continentAmericas:factor(year)1957 ***
## continentAsia:factor(year)1957     ***
## continentEurope:factor(year)1957      
## continentOceania:factor(year)1957     
## continentAfrica:factor(year)1962   ***
## continentAmericas:factor(year)1962 ** 
## continentAsia:factor(year)1962     ***
## continentEurope:factor(year)1962      
## continentOceania:factor(year)1962     
## continentAfrica:factor(year)1967   ***
## continentAmericas:factor(year)1967 *  
## continentAsia:factor(year)1967     ***
## continentEurope:factor(year)1967      
## continentOceania:factor(year)1967     
## continentAfrica:factor(year)1972   ***
## continentAmericas:factor(year)1972 *  
## continentAsia:factor(year)1972     ***
## continentEurope:factor(year)1972      
## continentOceania:factor(year)1972     
## continentAfrica:factor(year)1977   ***
## continentAmericas:factor(year)1977 .  
## continentAsia:factor(year)1977     ** 
## continentEurope:factor(year)1977      
## continentOceania:factor(year)1977     
## continentAfrica:factor(year)1982   ***
## continentAmericas:factor(year)1982    
## continentAsia:factor(year)1982     *  
## continentEurope:factor(year)1982      
## continentOceania:factor(year)1982     
## continentAfrica:factor(year)1987   ***
## continentAmericas:factor(year)1987    
## continentAsia:factor(year)1987     .  
## continentEurope:factor(year)1987      
## continentOceania:factor(year)1987     
## continentAfrica:factor(year)1992   ***
## continentAmericas:factor(year)1992    
## continentAsia:factor(year)1992        
## continentEurope:factor(year)1992      
## continentOceania:factor(year)1992     
## continentAfrica:factor(year)1997   ***
## continentAmericas:factor(year)1997    
## continentAsia:factor(year)1997        
## continentEurope:factor(year)1997      
## continentOceania:factor(year)1997     
## continentAfrica:factor(year)2002   ***
## continentAmericas:factor(year)2002    
## continentAsia:factor(year)2002        
## continentEurope:factor(year)2002      
## continentOceania:factor(year)2002     
## continentAfrica:factor(year)2007   ***
## continentAmericas:factor(year)2007    
## continentAsia:factor(year)2007        
## continentEurope:factor(year)2007      
## continentOceania:factor(year)2007     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 44.3151)
## 
##     Null deviance: 284148  on 1703  degrees of freedom
## Residual deviance:  72765  on 1642  degrees of freedom
## AIC: 11359
## 
## Number of Fisher Scoring iterations: 2

x1*x2 produces the cross of x1 and x2, or x1+x2+x1:x2

summary(glm(formula = lifeExp ~ gdpPercap + pop + continent*factor(year), 
                family = gaussian, data = gapminder))
## 
## Call:
## glm(formula = lifeExp ~ gdpPercap + pop + continent * factor(year), 
##     family = gaussian, data = gapminder)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -29.5073   -3.4687    0.1739    3.5145   21.1851  
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         3.871e+01  9.235e-01  41.916  < 2e-16
## gdpPercap                           3.356e-04  1.997e-05  16.805  < 2e-16
## pop                                 8.980e-10  1.586e-09   0.566 0.571275
## continentAmericas                   1.319e+01  1.621e+00   8.134 8.09e-16
## continentAsia                       5.822e+00  1.485e+00   3.920 9.22e-05
## continentEurope                     2.379e+01  1.529e+00  15.557  < 2e-16
## continentOceania                    2.708e+01  4.800e+00   5.642 1.98e-08
## factor(year)1957                    2.086e+00  1.306e+00   1.598 0.110304
## factor(year)1962                    4.067e+00  1.306e+00   3.115 0.001871
## factor(year)1967                    5.930e+00  1.306e+00   4.542 5.99e-06
## factor(year)1972                    7.948e+00  1.306e+00   6.087 1.43e-09
## factor(year)1977                    9.994e+00  1.306e+00   7.653 3.32e-14
## factor(year)1982                    1.204e+01  1.306e+00   9.221  < 2e-16
## factor(year)1987                    1.386e+01  1.306e+00  10.613  < 2e-16
## factor(year)1992                    1.414e+01  1.306e+00  10.830  < 2e-16
## factor(year)1997                    1.408e+01  1.306e+00  10.779  < 2e-16
## factor(year)2002                    1.373e+01  1.306e+00  10.511  < 2e-16
## factor(year)2007                    1.504e+01  1.306e+00  11.515  < 2e-16
## continentAmericas:factor(year)1957  4.129e-01  2.291e+00   0.180 0.857023
## continentAsia:factor(year)1957      7.150e-01  2.095e+00   0.341 0.732979
## continentEurope:factor(year)1957   -2.288e-01  2.159e+00  -0.106 0.915582
## continentOceania:factor(year)1957  -1.483e+00  6.784e+00  -0.219 0.826997
## continentAmericas:factor(year)1962  7.727e-01  2.291e+00   0.337 0.735962
## continentAsia:factor(year)1962      9.945e-01  2.095e+00   0.475 0.635119
## continentEurope:factor(year)1962   -8.452e-01  2.159e+00  -0.391 0.695499
## continentOceania:factor(year)1962  -3.043e+00  6.784e+00  -0.449 0.653798
## continentAmericas:factor(year)1967  6.632e-01  2.291e+00   0.289 0.772261
## continentAsia:factor(year)1967      2.145e+00  2.095e+00   1.024 0.306043
## continentEurope:factor(year)1967   -2.107e+00  2.160e+00  -0.976 0.329424
## continentOceania:factor(year)1967  -5.285e+00  6.784e+00  -0.779 0.436082
## continentAmericas:factor(year)1972  3.507e-01  2.291e+00   0.153 0.878376
## continentAsia:factor(year)1972      2.032e+00  2.096e+00   0.969 0.332439
## continentEurope:factor(year)1972   -3.873e+00  2.161e+00  -1.792 0.073375
## continentOceania:factor(year)1972  -7.349e+00  6.785e+00  -1.083 0.278856
## continentAmericas:factor(year)1977  1.084e-02  2.292e+00   0.005 0.996226
## continentAsia:factor(year)1977      2.404e+00  2.096e+00   1.147 0.251547
## continentEurope:factor(year)1977   -5.362e+00  2.163e+00  -2.478 0.013293
## continentOceania:factor(year)1977  -8.742e+00  6.785e+00  -1.288 0.197779
## continentAmericas:factor(year)1982 -2.520e-01  2.292e+00  -0.110 0.912450
## continentAsia:factor(year)1982      3.479e+00  2.096e+00   1.660 0.097163
## continentEurope:factor(year)1982   -6.988e+00  2.165e+00  -3.227 0.001276
## continentOceania:factor(year)1982  -9.780e+00  6.785e+00  -1.441 0.149674
## continentAmericas:factor(year)1987 -3.056e-01  2.292e+00  -0.133 0.893940
## continentAsia:factor(year)1987      3.829e+00  2.096e+00   1.827 0.067953
## continentEurope:factor(year)1987   -8.505e+00  2.169e+00  -3.922 9.14e-05
## continentOceania:factor(year)1987  -1.120e+01  6.786e+00  -1.651 0.098952
## continentAmericas:factor(year)1992  8.020e-01  2.292e+00   0.350 0.726462
## continentAsia:factor(year)1992      4.878e+00  2.097e+00   2.326 0.020134
## continentEurope:factor(year)1992   -7.940e+00  2.168e+00  -3.662 0.000258
## continentOceania:factor(year)1992  -1.001e+01  6.786e+00  -1.475 0.140318
## continentAmericas:factor(year)1997  2.164e+00  2.292e+00   0.944 0.345341
## continentAsia:factor(year)1997      6.019e+00  2.098e+00   2.869 0.004174
## continentEurope:factor(year)1997   -7.487e+00  2.172e+00  -3.446 0.000582
## continentOceania:factor(year)1997  -9.753e+00  6.788e+00  -1.437 0.150990
## continentAmericas:factor(year)2002  3.649e+00  2.293e+00   1.591 0.111701
## continentAsia:factor(year)2002      7.461e+00  2.099e+00   3.555 0.000388
## continentEurope:factor(year)2002   -6.827e+00  2.178e+00  -3.134 0.001753
## continentOceania:factor(year)2002  -8.833e+00  6.791e+00  -1.301 0.193515
## continentAmericas:factor(year)2007  2.942e+00  2.294e+00   1.283 0.199720
## continentAsia:factor(year)2007      6.864e+00  2.101e+00   3.267 0.001108
## continentEurope:factor(year)2007   -8.316e+00  2.187e+00  -3.803 0.000148
## continentOceania:factor(year)2007  -1.013e+01  6.793e+00  -1.492 0.135983
##                                       
## (Intercept)                        ***
## gdpPercap                          ***
## pop                                   
## continentAmericas                  ***
## continentAsia                      ***
## continentEurope                    ***
## continentOceania                   ***
## factor(year)1957                      
## factor(year)1962                   ** 
## factor(year)1967                   ***
## factor(year)1972                   ***
## factor(year)1977                   ***
## factor(year)1982                   ***
## factor(year)1987                   ***
## factor(year)1992                   ***
## factor(year)1997                   ***
## factor(year)2002                   ***
## factor(year)2007                   ***
## continentAmericas:factor(year)1957    
## continentAsia:factor(year)1957        
## continentEurope:factor(year)1957      
## continentOceania:factor(year)1957     
## continentAmericas:factor(year)1962    
## continentAsia:factor(year)1962        
## continentEurope:factor(year)1962      
## continentOceania:factor(year)1962     
## continentAmericas:factor(year)1967    
## continentAsia:factor(year)1967        
## continentEurope:factor(year)1967      
## continentOceania:factor(year)1967     
## continentAmericas:factor(year)1972    
## continentAsia:factor(year)1972        
## continentEurope:factor(year)1972   .  
## continentOceania:factor(year)1972     
## continentAmericas:factor(year)1977    
## continentAsia:factor(year)1977        
## continentEurope:factor(year)1977   *  
## continentOceania:factor(year)1977     
## continentAmericas:factor(year)1982    
## continentAsia:factor(year)1982     .  
## continentEurope:factor(year)1982   ** 
## continentOceania:factor(year)1982     
## continentAmericas:factor(year)1987    
## continentAsia:factor(year)1987     .  
## continentEurope:factor(year)1987   ***
## continentOceania:factor(year)1987  .  
## continentAmericas:factor(year)1992    
## continentAsia:factor(year)1992     *  
## continentEurope:factor(year)1992   ***
## continentOceania:factor(year)1992     
## continentAmericas:factor(year)1997    
## continentAsia:factor(year)1997     ** 
## continentEurope:factor(year)1997   ***
## continentOceania:factor(year)1997     
## continentAmericas:factor(year)2002    
## continentAsia:factor(year)2002     ***
## continentEurope:factor(year)2002   ** 
## continentOceania:factor(year)2002     
## continentAmericas:factor(year)2007    
## continentAsia:factor(year)2007     ** 
## continentEurope:factor(year)2007   ***
## continentOceania:factor(year)2007     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 44.3151)
## 
##     Null deviance: 284148  on 1703  degrees of freedom
## Residual deviance:  72765  on 1642  degrees of freedom
## AIC: 11359
## 
## Number of Fisher Scoring iterations: 2

Inferences/Regression Diagnostics

bptest(reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 174.08, df = 7, p-value < 2.2e-16
bgtest(reg)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  reg
## LM test = 1241.4, df = 1, p-value < 2.2e-16
dwtest(reg)
## 
##  Durbin-Watson test
## 
## data:  reg
## DW = 0.29392, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
coeftest(x = reg, vcov. = vcovHC)
## 
## z test of coefficients:
## 
##                      Estimate  Std. Error  z value  Pr(>|z|)    
## (Intercept)       -5.1846e+02  2.5634e+01 -20.2249 < 2.2e-16 ***
## gdpPercap          2.9849e-04  4.8328e-05   6.1763 6.561e-10 ***
## pop                1.7906e-09  1.3057e-09   1.3714    0.1703    
## continentAmericas  1.4292e+01  5.2624e-01  27.1588 < 2.2e-16 ***
## continentAsia      9.3755e+00  5.6555e-01  16.5776 < 2.2e-16 ***
## continentEurope    1.9361e+01  6.9851e-01  27.7177 < 2.2e-16 ***
## continentOceania   2.0559e+01  1.0739e+00  19.1441 < 2.2e-16 ***
## year               2.8626e-01  1.3013e-02  21.9982 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Breakout and overnight homework

dplyr

  1. Use dplyr to create a data frame containing the median lifeExp for each continent

  2. Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.

  3. Use dplyr to: add a column called gdpPercap_diff that contains the difference between the observation's gdpPercap and the mean gdpPercap of the continent in that year. Arrange the dataframe by the column you just created, in descending order (so that the relatively richest country/years are listed first)

hint: You might have to ungoup() before you arrange().

tidyr

  1. Subset the results from question #3 to select only the country, year, and gdpPercap_diff columns. Use tidyr put it in wide format so that countries are rows and years are columns.

regressions

  1. Fit two linear regression models from the gapminder data, where the outcome is gdpPercap and the explanatory variables are pop, lifeExp, and year. In one model, treat year as a numeric variable. In the other, factorize the year variable. How do you interpret each model?

  2. Fix a logit model where the outcome is whether gdpPercap_diff is positive or negative – that is, whether an observation is in the upper half or lower half of the continent's wealth in a given year. The explanatory variables should be country, lifeExp, and pop.